Skip to content

[Improve](variant) Keep first duplicate Variant JSON path#63082

Merged
eldenmoon merged 2 commits into
apache:masterfrom
eldenmoon:codex/cir-19941-duplicate-json-path
May 12, 2026
Merged

[Improve](variant) Keep first duplicate Variant JSON path#63082
eldenmoon merged 2 commits into
apache:masterfrom
eldenmoon:codex/cir-19941-duplicate-json-path

Conversation

@eldenmoon
Copy link
Copy Markdown
Member

Add BE config variant_enable_duplicate_json_path_check to keep the first duplicate JSON object key during Variant parsing when enabled.

Copilot AI review requested due to automatic review settings May 8, 2026 10:21
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a backend config to change how VARIANT parsing handles duplicate JSON object keys: when enabled, it keeps the first occurrence instead of erroring on duplicate paths. It also adds test coverage for this behavior, while removing several existing VARIANT nested regression suites and adding a new export/load regression for wide VARIANT data.

Changes:

  • Add BE config variant_enable_duplicate_json_path_check (default off) and implement “keep first duplicate key” behavior during JSON traversal for VARIANT extraction.
  • Add BE unit tests and a VARIANT regression test + data file to validate duplicate-key behavior (insert + stream load).
  • Add an export/load regression test for VARIANT with many potential subcolumns; remove multiple VARIANT nested regression suites.

Reviewed changes

Copilot reviewed 11 out of 12 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
regression-test/suites/variant_p0/nested2.groovy Deleted an existing VARIANT nested-type conflict regression suite.
regression-test/suites/variant_p0/nested/sql/q01.sql Deleted SQL outputs for a removed nested VARIANT load suite.
regression-test/suites/variant_p0/nested/nested_in_top_array.groovy Deleted a regression suite covering nested-in-top-array VARIANT behavior.
regression-test/suites/variant_p0/nested/load.groovy Deleted a (mostly commented) nested VARIANT load regression suite.
regression-test/suites/variant_p0/nested.groovy Deleted a large VARIANT nested regression suite.
regression-test/suites/variant_p0/duplicate_json_path.groovy Added regression coverage for keeping the first duplicate JSON key when the new BE config is enabled.
regression-test/suites/export_p2/test_export_variant_10k_columns.groovy Added export→S3→reload regression for wide/sparse VARIANT; currently has correctness and cleanup issues.
regression-test/data/variant_p0/duplicate_json_path.json Added stream-load input with duplicate JSON keys.
be/test/storage/segment/variant_util_test.cpp Added unit tests validating “keep first duplicate key” behavior in both subcolumn-only and doc-mode parsing.
be/src/util/json/json_parser.cpp Added duplicate-key skipping logic during object traversal when the config is enabled.
be/src/common/config.h Declared the new BE config flag.
be/src/common/config.cpp Defined the new BE config flag with default false.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread be/src/util/json/json_parser.cpp Outdated
Comment on lines 111 to 132
if (!config::variant_enable_duplicate_json_path_check) {
for (auto it = object.begin(); it != object.end(); ++it) {
const auto& [key, value] = *it;
traverse_object_member(key, value);
}
return;
}

std::unordered_set<PathInData, PathInData::Hash> visited_member_paths;
visited_member_paths.reserve(object.size());
for (auto it = object.begin(); it != object.end(); ++it) {
const auto& [key, value] = *it;
check_key_length(key);
ctx.builder.append(key, false);
PathInData member_path(ctx.builder.get_parts());
if (!visited_member_paths.insert(std::move(member_path)).second) {
ctx.builder.pop_back();
continue;
}
traverse(value, ctx);
ctx.builder.pop_back();
}
Comment on lines +18 to +22
suite("duplicate_json_path", "p0") {
def customBeConfig = [
variant_enable_duplicate_json_path_check: true
]
setBeConfigTemporary(customBeConfig) {
@eldenmoon eldenmoon force-pushed the codex/cir-19941-duplicate-json-path branch from a7bc101 to abc6911 Compare May 8, 2026 10:42
@eldenmoon
Copy link
Copy Markdown
Member Author

run buildall

@eldenmoon
Copy link
Copy Markdown
Member Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one additional blocking correctness issue beyond the already-known review threads. The enabled duplicate-path check still allows flat dotted keys and nested object paths to produce the same downstream Variant path string, so the load/WAL failure this PR is meant to prevent can still occur for inputs such as {"a.b":1,"a":{"b":2}}.

Critical checkpoint conclusions:

  • Goal/test: The goal is to optionally keep the first duplicate Variant JSON path and avoid duplicate/conflicting subcolumn paths; current tests cover repeated object member names but not dotted-key versus nested-path conflicts, and the implementation does not fully accomplish the goal.
  • Scope: The change is small and focused, but the local duplicate key check is too narrow for Variant path identity.
  • Concurrency/lifecycle: No new concurrency or special lifecycle management is introduced; the mutable config follows existing BE config patterns.
  • Configuration: A disabled-by-default mutable BE config is added; behavior is observed at parse time.
  • Compatibility: No storage/protocol format change was found.
  • Parallel paths: Existing review thread already covers the JSONB conversion path separately; this comment covers a distinct subcolumn/doc-value path conflict.
  • Tests: Added unit/regression coverage misses the dotted-key conflict case; no tests were run by me in this automated review.
  • Observability/performance: No additional observability requirement found; the extra check is not on an obviously problematic hot path when disabled.
  • User focus: No additional user-provided review focus was supplied.

Comment thread be/src/util/json/json_parser.cpp Outdated
@eldenmoon eldenmoon force-pushed the codex/cir-19941-duplicate-json-path branch 4 times, most recently from 961fd01 to 99aa5b6 Compare May 9, 2026 07:00
@eldenmoon
Copy link
Copy Markdown
Member Author

run buildall

@eldenmoon
Copy link
Copy Markdown
Member Author

/review

### What problem does this PR solve?

Issue Number: CIR-19941

Related PR: None

Problem Summary: Variant JSON parsing could emit duplicate or conflicting downstream subcolumn paths when duplicate object keys, flat dotted keys, nested object paths, or array/scalar leaf values normalize to the same Variant path. This could later fail load or WAL replay with subcolumn type size mismatch. Add a disabled-by-default BE config to enable duplicate JSON path checking; when enabled, duplicate leaf paths keep the first value while object entries that expand to distinct child paths are preserved. The duplicate-check setting is captured in ParseConfig so parser and materialization use the same per-call behavior.

### Release note

Add BE config variant_enable_duplicate_json_path_check to keep the first duplicate Variant JSON leaf path when enabled; default is false.

### Check List (For Author)

- Test: Unit Test and Regression test
    - Unit Test: ./run-be-ut.sh --run --filter='VariantUtilTest.ParseDuplicateJsonPaths*'
    - Build: source ./custom_env.sh; export DORIS_PARALLELISM="${DORIS_PARALLELISM:-39}"; export BUILD_TYPE="${BUILD_TYPE:-ASAN}"; export USE_MEM_TRACKER=ON; ./build.sh --be -j"${DORIS_PARALLELISM}"
    - Regression test: env -u http_proxy -u https_proxy -u HTTP_PROXY -u HTTPS_PROXY -u all_proxy -u ALL_PROXY NO_PROXY=127.0.0.1,localhost,172.19.0.1 no_proxy=127.0.0.1,localhost,172.19.0.1 ./run-regression-test.sh --run --conf tmp/regression-conf.auto.groovy -d variant_p0 -s duplicate_json_path
    - Review: GPT-5.5 xhigh subagent final review found no P0/P1 blocking issues
- Behavior changed: Yes. When variant_enable_duplicate_json_path_check is enabled, duplicate downstream leaf paths keep the first value and duplicate object entries can merge distinct child paths; default is false.
- Does this need documentation: No
@eldenmoon eldenmoon force-pushed the codex/cir-19941-duplicate-json-path branch from 99aa5b6 to 102aad0 Compare May 9, 2026 08:15
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review conclusion: request changes. I did not add duplicate inline comments because the remaining blocking concern is already covered by an existing thread on be/src/util/json/json_parser.cpp: traverseObjectAsJsonb() still preserves duplicate object keys inside JSONB output when variant_enable_duplicate_json_path_check is enabled, so nested arrays/objects converted through the JSONB path do not consistently keep the first duplicate key.

Critical checkpoints:

  • Goal/test: The PR aims to make duplicate Variant JSON paths keep the first value when enabled. The added BE unit and regression tests cover subcolumn/doc materialization, flat dotted keys, object/scalar conflicts, insert, stream load, and compaction, but the JSONB conversion path remains uncovered and inconsistent with the config comment.
  • Scope/focus: The code is mostly focused on Variant parsing and materialization; no unrelated current PR file changes were found in the GitHub PR diff.
  • Concurrency/lifecycle: No new concurrent state or non-trivial lifecycle management is introduced. The new mutable config is read per parse/materialization call into ParseConfig; no lock-order or thread lifecycle issue was found.
  • Configuration: A mutable BE config is added and propagated to the main deserialize/materialize paths, but behavior is not applied uniformly to JSONB serialization of nested arrays/objects.
  • Compatibility/storage: No serialized format or FE/BE protocol format change was found. Default behavior remains disabled, so compatibility risk is limited.
  • Parallel paths: The subcolumn/doc-value path is updated, but the JSONB path remains a parallel path with inconsistent duplicate handling.
  • Tests/results: Tests are meaningful for the updated paths and use ordered query output. Missing coverage remains for JSONB duplicate-key handling under the new config.
  • Observability/performance: No extra observability appears necessary. The duplicate path checks add per-row hash-set work only when enabled; no additional performance blocker found.

User focus: no additional user-provided review focus was specified.

Existing review context: The earlier flat-dot/nested path-collision concern appears addressed in the current head by parser-level get_path() deduplication plus path normalization/materialization tests. The earlier regression-test deletion concern appears stale for the current GitHub PR file list. The JSONB duplicate-key concern remains applicable and should be resolved before approval.

@eldenmoon
Copy link
Copy Markdown
Member Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 29950 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 102aad0ae5ef513aadeb242cd64a157741da5c17, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17592	3882	3850	3850
q2	q3	10688	879	619	619
q4	4687	460	343	343
q5	7660	1351	1169	1169
q6	276	167	137	137
q7	932	963	754	754
q8	10721	1458	1329	1329
q9	7169	5396	5353	5353
q10	6302	2083	1845	1845
q11	474	264	259	259
q12	668	420	295	295
q13	18140	3349	2777	2777
q14	292	281	256	256
q15	q16	900	872	795	795
q17	981	1112	758	758
q18	6618	5814	5725	5725
q19	1386	1294	1201	1201
q20	542	411	267	267
q21	4756	2360	1895	1895
q22	450	371	323	323
Total cold run time: 101234 ms
Total hot run time: 29950 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4765	4622	4622	4622
q2	q3	4671	4857	4220	4220
q4	2128	2165	1398	1398
q5	4994	5023	5320	5023
q6	201	171	130	130
q7	2047	1787	1667	1667
q8	3491	3108	3321	3108
q9	8403	8428	8454	8428
q10	4503	4558	4272	4272
q11	608	412	392	392
q12	721	744	509	509
q13	3260	3572	2946	2946
q14	303	307	278	278
q15	q16	801	792	704	704
q17	1353	1310	1287	1287
q18	8008	6965	7180	6965
q19	1210	1162	1173	1162
q20	2239	2196	1921	1921
q21	6116	5335	4818	4818
q22	525	471	401	401
Total cold run time: 60347 ms
Total hot run time: 54251 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 170427 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 102aad0ae5ef513aadeb242cd64a157741da5c17, data reload: false

query5	4307	654	526	526
query6	327	225	201	201
query7	4296	560	314	314
query8	344	239	233	233
query9	8862	4049	4044	4044
query10	453	356	313	313
query11	5810	2448	2209	2209
query12	193	137	133	133
query13	1303	595	441	441
query14	6283	5396	5065	5065
query14_1	4397	4418	4340	4340
query15	213	203	181	181
query16	1016	478	427	427
query17	1118	747	618	618
query18	2758	482	352	352
query19	228	206	161	161
query20	144	134	129	129
query21	228	145	119	119
query22	13711	13665	13282	13282
query23	17226	16368	16117	16117
query23_1	16202	16135	16158	16135
query24	7349	1747	1356	1356
query24_1	1357	1346	1358	1346
query25	561	495	438	438
query26	1168	322	169	169
query27	2699	602	337	337
query28	4382	1992	1951	1951
query29	978	661	544	544
query30	313	244	202	202
query31	1125	1082	959	959
query32	102	79	75	75
query33	578	359	320	320
query34	1203	1129	625	625
query35	780	793	690	690
query36	1321	1368	1219	1219
query37	157	114	96	96
query38	3213	3143	3030	3030
query39	936	938	896	896
query39_1	872	875	876	875
query40	244	171	145	145
query41	71	69	71	69
query42	118	113	110	110
query43	324	338	288	288
query44	
query45	217	207	197	197
query46	1049	1215	737	737
query47	2315	2365	2234	2234
query48	407	422	307	307
query49	667	568	452	452
query50	716	284	220	220
query51	4417	4312	4256	4256
query52	106	108	95	95
query53	265	300	210	210
query54	331	289	282	282
query55	95	93	86	86
query56	316	324	328	324
query57	1404	1407	1298	1298
query58	314	277	271	271
query59	1583	1619	1485	1485
query60	339	334	320	320
query61	161	159	159	159
query62	666	625	560	560
query63	240	201	210	201
query64	2346	803	659	659
query65	
query66	1698	526	396	396
query67	30031	29987	29849	29849
query68	
query69	473	339	297	297
query70	994	1047	1015	1015
query71	311	283	271	271
query72	2886	2813	1982	1982
query73	864	794	422	422
query74	5098	4929	4738	4738
query75	2769	2678	2336	2336
query76	2282	1144	765	765
query77	423	418	360	360
query78	12885	13031	12423	12423
query79	1512	980	695	695
query80	1335	574	506	506
query81	491	278	240	240
query82	1300	160	123	123
query83	352	289	258	258
query84	263	145	111	111
query85	937	491	439	439
query86	446	339	344	339
query87	3435	3365	3242	3242
query88	3529	2664	2667	2664
query89	445	377	345	345
query90	1906	195	176	176
query91	177	168	136	136
query92	78	76	68	68
query93	982	957	567	567
query94	731	335	309	309
query95	669	458	364	364
query96	1035	764	347	347
query97	2725	2680	2571	2571
query98	238	227	233	227
query99	1138	1128	993	993
Total cold run time: 254426 ms
Total hot run time: 170427 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 92.65% (63/68) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.53% (20635/38548)
Line Coverage 37.17% (194838/524170)
Region Coverage 33.48% (151809/453373)
Branch Coverage 34.55% (66265/191806)

@eldenmoon eldenmoon changed the title [fix](variant) Keep first duplicate Variant JSON path [Improve](variant) Keep first duplicate Variant JSON path May 11, 2026
Comment thread be/src/exec/common/variant_util.cpp Outdated
const auto& path_name = path.get_path();
for (auto& entry : column_variant.get_subcolumns()) {
if (entry->path.get_path() == path_name && is_plain_path(entry->path)) {
subcolumn = &entry->data;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果 check_duplicate_json_path 是true 的时候,path 都是normalized_plain_path, 这个分支 这个不可能走到。

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleted code

Comment thread be/src/util/json/json_parser.h Outdated
void traverse(const Element& element, ParseContext& ctx);
void traverseObject(const JSONObject& object, ParseContext& ctx);
void traverseArray(const JSONArray& array, ParseContext& ctx);
bool tryAppendValue(ParseContext& ctx, const PathInData::Parts& path, Field&& value,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个函数的返回值似乎没被用到过,如果tryAppendValue 失败会怎么样

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

只保留其中一个, 第一个

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -0,0 +1,7 @@
{"k":8,"v":{"a":42,"a":{"b":42}}}
{"k":9,"v":{"a":123,"a":"123"}}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个设置 variant_enable_duplicate_json_path_check 是true 了,还是有两个 a ,这个变量variant_enable_duplicate_json_path_check为true ,期望报错还是默认丢弃一个还是其他。

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

只保留其中一个, 第一个

### What problem does this PR solve?

Issue Number: CIR-19941

Related PR: apache#63082

Problem Summary: Address review comments around duplicate Variant JSON path handling. The parser helper now has no unused return value and clearly skips duplicate paths to keep the first value. The materialization path no longer carries an unreachable duplicate-path fallback after path normalization, and the regression test documents the keep-first expectation for duplicate keys.

### Release note

None

### Check List (For Author)

- Test: Unit Test
    - Unit Test: ./run-be-ut.sh --run --filter=VariantUtilTest.*
    - Format: /mnt/disk1/claude-max/ldb_toolchain16/bin/clang-format-16 -i be/src/util/json/json_parser.h be/src/util/json/json_parser.cpp be/src/exec/common/variant_util.cpp
- Behavior changed: No
- Does this need documentation: No
@eldenmoon
Copy link
Copy Markdown
Member Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 29703 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit c1b13b6da1b5721a403ec0cff1b1c48f892e5ff1, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17648	3950	3853	3853
q2	q3	10699	890	621	621
q4	4675	461	349	349
q5	7459	1342	1152	1152
q6	187	177	144	144
q7	923	940	771	771
q8	9334	1418	1284	1284
q9	5517	5410	5294	5294
q10	6243	2090	1859	1859
q11	472	266	256	256
q12	634	410	293	293
q13	18074	3432	2789	2789
q14	290	281	259	259
q15	q16	899	864	789	789
q17	1011	995	745	745
q18	6488	5659	5527	5527
q19	1307	1144	1160	1144
q20	504	397	264	264
q21	4931	2391	1966	1966
q22	494	403	344	344
Total cold run time: 97789 ms
Total hot run time: 29703 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4772	4724	4906	4724
q2	q3	4657	4777	4180	4180
q4	2135	2174	1401	1401
q5	4984	5000	5294	5000
q6	195	173	138	138
q7	2033	1777	1571	1571
q8	3359	3082	3078	3078
q9	8352	8378	8453	8378
q10	4474	4487	4272	4272
q11	623	430	460	430
q12	717	770	548	548
q13	3302	3594	2971	2971
q14	302	304	284	284
q15	q16	783	793	832	793
q17	1403	1365	1344	1344
q18	8040	7167	7225	7167
q19	1203	1151	1154	1151
q20	2227	2231	1977	1977
q21	6298	5558	5025	5025
q22	563	529	423	423
Total cold run time: 60422 ms
Total hot run time: 54855 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 172174 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit c1b13b6da1b5721a403ec0cff1b1c48f892e5ff1, data reload: false

query5	4299	652	541	541
query6	341	233	203	203
query7	4239	589	304	304
query8	313	228	223	223
query9	8837	4099	4051	4051
query10	450	332	300	300
query11	5805	2396	2256	2256
query12	181	133	126	126
query13	1266	598	443	443
query14	6050	5363	5150	5150
query14_1	4408	4408	4390	4390
query15	211	200	178	178
query16	962	433	429	429
query17	1039	737	612	612
query18	2431	488	354	354
query19	212	196	163	163
query20	140	136	134	134
query21	211	141	120	120
query22	13582	14000	14439	14000
query23	17183	16543	16129	16129
query23_1	16359	16293	16202	16202
query24	7505	1746	1367	1367
query24_1	1365	1363	1355	1355
query25	555	492	432	432
query26	1319	323	166	166
query27	2712	619	344	344
query28	4459	1983	1972	1972
query29	986	634	506	506
query30	292	242	196	196
query31	1096	1057	935	935
query32	82	72	74	72
query33	518	347	320	320
query34	1141	1147	622	622
query35	769	777	672	672
query36	1291	1326	1096	1096
query37	143	102	82	82
query38	3216	3135	3064	3064
query39	918	918	881	881
query39_1	901	870	862	862
query40	234	157	136	136
query41	63	61	59	59
query42	111	111	111	111
query43	337	331	289	289
query44	
query45	209	196	192	192
query46	1071	1202	722	722
query47	2317	2285	2180	2180
query48	411	426	306	306
query49	633	536	424	424
query50	715	290	218	218
query51	4242	4306	4216	4216
query52	104	106	96	96
query53	251	291	206	206
query54	322	277	266	266
query55	97	89	88	88
query56	315	302	305	302
query57	1427	1373	1299	1299
query58	304	273	270	270
query59	1546	1604	1387	1387
query60	369	336	324	324
query61	167	158	156	156
query62	664	618	558	558
query63	247	202	203	202
query64	2456	833	734	734
query65	
query66	1784	539	409	409
query67	29904	30019	29771	29771
query68	
query69	467	342	316	316
query70	968	995	1007	995
query71	323	289	283	283
query72	3204	2547	2645	2547
query73	822	746	396	396
query74	5063	4931	4743	4743
query75	2823	2657	2329	2329
query76	2290	1157	750	750
query77	424	428	355	355
query78	13065	12935	12340	12340
query79	1555	996	731	731
query80	1353	561	507	507
query81	526	280	235	235
query82	987	164	121	121
query83	361	276	248	248
query84	275	141	108	108
query85	908	530	437	437
query86	453	329	340	329
query87	3395	3360	3222	3222
query88	3627	2679	2671	2671
query89	440	380	336	336
query90	1991	184	173	173
query91	181	170	140	140
query92	82	77	75	75
query93	1109	962	563	563
query94	729	335	314	314
query95	657	474	352	352
query96	1032	804	354	354
query97	2704	2669	2592	2592
query98	238	233	221	221
query99	1114	1112	975	975
Total cold run time: 254144 ms
Total hot run time: 172174 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 96.55% (56/58) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.77% (27860/37768)
Line Coverage 57.61% (301275/522981)
Region Coverage 54.70% (250560/458041)
Branch Coverage 56.29% (108462/192684)

@eldenmoon
Copy link
Copy Markdown
Member Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one blocking correctness issue in the new duplicate path handling. The PR goal is clear and mostly focused, and it adds both BE unit coverage and a regression case for duplicate object keys, flat/nested path conflicts, doc mode, stream load, and compaction. However, array aggregation still has a distinct flat-key/nested-path collision path that can silently drop values when the duplicate check is enabled.

Critical checkpoint conclusions:

  • Goal/test: The PR aims to keep the first duplicate Variant JSON path when enabled; current tests cover object-level duplicates but miss duplicate logical paths produced by different elements of the same array.
  • Scope: The implementation is small and focused, but the array path handling needs to be included for correctness.
  • Concurrency/lifecycle: No new shared mutable lifecycle or locking concerns beyond reading the mutable BE config per parse.
  • Configuration: The new BE config is mutable and is read at parse/materialization entry points; behavior remains disabled by default.
  • Compatibility: No storage-format or thrift incompatibility found.
  • Parallel paths: Subcolumn and doc-value paths are both wired, but the array aggregation path remains inconsistent.
  • Tests: Existing new tests are useful but need an array case like {"x":[{"a.b":1},{"a":{"b":2}}]} to prove values are merged rather than dropped.
  • Observability/performance/transactions: No additional observability, transaction, or obvious performance concerns found for this focused change.

User focus: No additional user-provided review focus was supplied.

Comment thread be/src/util/json/json_parser.cpp
@github-actions
Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label May 12, 2026
@eldenmoon eldenmoon added the usercase Important user case type label label May 12, 2026
@eldenmoon eldenmoon merged commit f563b2e into apache:master May 12, 2026
36 of 37 checks passed
github-actions Bot pushed a commit that referenced this pull request May 12, 2026
Add BE config variant_enable_duplicate_json_path_check to keep the first
duplicate JSON object key during Variant parsing when enabled.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/4.1.x usercase Important user case type label

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants